Biniam Abebe - 04/27/2024

Hands-on Assignment

Complete the following two sections on Supervised Machine Learning:

CART and k-Nearest Neighbors

Part 1: CART

Supervised Machine Learning CART

SL.jpg

STEP 1: Import Libraries

WORKFLOW: DATA SET

STEP 2: Read data description and Load the Data

Description of Boston Housing Dataset

We will investigate the Boston House Price dataset as you did with the linear regression homework. Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows:

Note: For this assignment, we use a subset of the original dataset.

WORKFLOW: Clean and Preprocess the Dataset

STEP 3: Clean the data

STEP 4: Performing the Exploratory Data Analysis (EDA)

STEP 4A: Create Histograms

STEP 4B: Create Scatter Plots

STEP 4C: Join Plots with Seaborn

IMPORTANT NOTE: You can find more information on joint plots here http://seaborn.pydata.org/generated/seaborn.jointplot.html

WORKFLOW: DATA SPLIT

STEP 5: Separate the Dataset into Input & Output NumPy Arrays

STEP 6: Split into Input/Output Array into Training/Testing Datasets

WORKFLOW: TRAIN MODEL

STEP 7: Build and Train the Model

WORKFLOW: SCORE MODEL

STEP 8: Calculate R-Squared

** Note: The higher the R-squared, the better (0 – 100%). Depending on the model, the best models score above 83%. The R-squared value tells us how well the independent variables predict the dependent variable, which is very low. Think about how you could increase the R-squared. What variables would you use?

Step 9: Prediction

We are using the following predictors for the 1st prediction:

Notes: So, the model predicts that the median value of owner-occupied homes in 1000 dollars in the above suburb should be around $12,600.

We are using the following predictors for the 2nd prediction:

Notes: So, the model predicts that the median value of owner-occupied homes in 1000 dollars in the above suburb should be around $15,700.

WORKFLOW: EVALUATE MODELS

Step 10: Train & Score Model 2 Using K-Fold Cross Validation Data Split

Notes: After we train, we evaluate. We are using K-fold to determine if the model is acceptable. We pass the whole set since the system will divide it for us. This value would traditionally be a positive value but scikit reports this value as a negative value. If you want a positive number, you may calculate the square root of the Negative Mean Squared Error value.


Part 2: k-Nearest Neighbors (kNN)

Supervised Machine Learning k-Nearest Neighbors (kNN)

• Let's begin Part 2 using the same Supervised Learning Workflow used in part 1.

STEP 1: Import Libraries

WORKFLOW: DATA SET

STEP 2: Read data description and Load the Data

Description Iris Dataset

Data Set: iris.csv

Title: Iris Plants Database Updated Sept 21 by C. Blake -Added discrepancy information Sources:

Relevant Information: This is perhaps the best-known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example)

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Predicted attribute: class of Iris plant

Number of Instances: 150 (50 in each of three classes)

Number of predictors: 4 numeric

Predictive attributes and the class attribute information:

class:

flower.jpg

WORKFLOW: Clean and Preprocess the Dataset

STEP 3: Clean the data

STEP 4: Performing the Exploratory Data Analysis (EDA)

STEP 4A: Create Histograms

Step 4B: Density plots

Step 4C: Create Boxplots

Step 4C: Create Scatter plots

WORKFLOW: DATA SPLIT

STEP 5: Separate the Dataset into Input & Output NumPy Arrays

STEP 6: Split into Input/Output Array into Training/Testing Datasets

WORKFLOW: TRAIN MODEL

STEP 7: Build and Train the Model

WORKFLOW: SCORE MODEL 1

STEP 8: Score the Accuracy of the Model

Step 9: Prediction

Note: We have now trained the model and using that trained model to predict the type of flower we have with the listed values for each variable.

WORKFLOW: EVALUATE MODELS

Step 10: Train & Score Model 2 Using K-Fold Cross Validation Data Split

GREAT JOB! YOU ARE DONE.